Fault-Tolerance Using Cache-Coherent Distributed Shared Memory Systems

نویسندگان

  • Diana Hecht
  • Krishna M. Kavi
  • Rhonda Kay Gaede
  • Constantine Katsinis
چکیده

In this paper, we describe new protocols augmenting traditional cache coherency mechanisms to implement fault-tolerance based on Recovery Blocks and checkpointing. Concurrent processes compound rollback recovery since the rollback can potentially lead to a "domino-effect" whereby the process is rolled back to the beginning. Several approaches have been proposed to limit the domino effect. One set of such techniques requires communicating processes to periodically synchronize in order to checkpoint a globally consistent state. These schemes can be implemented more naturally on distributed shared memory systems using synchronization on shared memory. We have developed extensions to well known cache-coherency methods (e.g., directory-based) for the implementation of checkpointing

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hierarchical Shared Memory Cluster Architecture with Load Balancing and Fault Tolerance

Recently a great deal of attention has been paid to the design of hierarchical shared memory cluster system. Cluster computing has made hierarchical computing systems increasingly common as target environment for large-scale scientific computations. This paper proposes hierarchical shared memory cluster architecture with load balancing and fault tolerance. Hierarchies of shared memory and cache...

متن کامل

Fault tolerance and configurability in DSM coherence protocols

With the advent of large networks and the demand to have uninterrupted service, computer systems need to be more robust and fault tolerant. There are numerous ways to implement fault tolerance and recovery. A central concept in all these methods is the requirement for replicated data for high data availability. We believe that a protocol must not only provide replication, but do so at low opera...

متن کامل

Design of a Simulator for Large-Scale Distributed Shared-Memory Cache-Coherent Architectures

As the scale and the complexity of parallel computer systems grow rapidly, the study of interactions between application algorithms and parallel architectures becomes more important. Execution-driven simulation under realistic workloads proves to be an accurate and eecient technique for studying the performance of computer systems. However, direct-execution simulation of shared-memory cache-coh...

متن کامل

Fault-Tolerant Distributed-Shared-Memory on a Broadcast-Based Interconnection Network

The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) is a low-latency, high-bandwidth interconnection network which directly links arbitrary pairs of processor nodes without contention, and can efficiently interconnect over one hundred nodes. Each node has a dedicated output channel and an array of receivers, with one receiver dedicated to every other node’s output channel. The SOME-...

متن کامل

Experiences with Distributed Shared Memory

A major problem with programming systems such as distributed memory multicom-puters or networks of workstations has been the necessity for explicit, time-consuming and expensive message passing. Distributed shared memory enables such systems to appear to have a common memory though they may not physically share it. The Systems Architecture Research Centre at City University has worked on implem...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999